{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c0cab37c",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "We will use the Transformers library from HuggingFace which is pip-installable:\n",
    "\n",
    "pip install transformers\n",
    "\n",
    "You'll also probably want to use PyTorch"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7bc069c",
   "metadata": {},
   "source": [
    "## Exercise 1: Tokenization and Exbedding Exploration\n",
    "\n",
    "The aim of this exercise is to visualize how text is broken down into tokens and converted into embeddings. \n",
    "\n",
    "1) Create a short ten word sentence\n",
    "2) Tokenize it using a tokenizer from the Hugging Face model bert-base-uncased\n",
    "3) Decode the tokens back into words\n",
    "4) Use the model's embedding layer to project tokens into vectors\n",
    "5) Visualize the embeddings using PCA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d301c16",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer, AutoModel\n",
    "import torch\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.decomposition import PCA\n",
    "\n",
    "# Load a small model\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"distilbert-base-uncased\")\n",
    "model = AutoModel.from_pretrained(\"distilbert-base-uncased\")\n",
    "\n",
    "# Tokenize input\n",
    "sentence = \"Transformers are amazing models for NLP.\"\n",
    "tokens = tokenizer(sentence, return_tensors=\"pt\")\n",
    "input_ids = tokens[\"input_ids\"]\n",
    "attention_mask = tokens[\"attention_mask\"]\n",
    "\n",
    "# Show tokenized inputs\n",
    "print(\"Input IDs:\", input_ids)\n",
    "print(\"Attention Mask:\", attention_mask)\n",
    "\n",
    "# Decode input IDs back into tokens\n",
    "decoded_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])\n",
    "print(\"Decoded Tokens:\", decoded_tokens)\n",
    "\n",
    "# Get embeddings\n",
    "with torch.no_grad():\n",
    "    outputs = model(**tokens)\n",
    "embeddings = outputs.last_hidden_state.squeeze(0)\n",
    "\n",
    "# Reduce dimension for visualization\n",
    "pca = PCA(n_components=2)\n",
    "reduced = pca.fit_transform(embeddings.numpy())\n",
    "\n",
    "plt.figure(figsize=(8, 5))\n",
    "for i, label in enumerate(decoded_tokens):\n",
    "    x, y = reduced[i]\n",
    "    plt.scatter(x, y)\n",
    "    plt.text(x + 0.01, y + 0.01, label)\n",
    "plt.title(\"Token Embeddings Visualized via PCA\")\n",
    "plt.xlabel(\"PCA 1\")\n",
    "plt.ylabel(\"PCA 2\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71b8ea43",
   "metadata": {},
   "source": [
    "## Exercise 2: Build Your Own Scaled Dot-Product Attention\n",
    "\n",
    "This exercise gets you familiar with the attention mechanism from scratch on small data.\n",
    "\n",
    "1) Generate small random matrices for queries, keys, and values\n",
    "2) Implement the scaled dot-product attention:\n",
    "\n",
    "$ Attention(Q, K, V) = softmax \\left( \\frac{QK^T}{\\sqrt{d_k}} \\right) V $\n",
    "\n",
    "3) Visualize the attention weights as a heatmap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ffb44e94",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Create random Q, K, V\n",
    "np.random.seed(0)\n",
    "Q = np.random.rand(3, 4) # Queries\n",
    "K = np.random.rand(3, 4) # Keys\n",
    "V = np.random.rand(3, 4) # Values\n",
    "\n",
    "# Scaled dot-product attention\n",
    "d_k = Q.shape[1]\n",
    "scores = Q @ K.T / np.sqrt(d_k)\n",
    "weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)\n",
    "output = weights @ V\n",
    "\n",
    "# Print the attention weights and output\n",
    "print(\"Scaled Dot-Product Scores:\\n\", scores)\n",
    "print(\"Attention Weights (softmax):\\n\", weights)\n",
    "print(\"Output:\\n\", output)\n",
    "\n",
    "# Plot the attention weights as a heatmap\n",
    "plt.figure(figsize=(6, 5))\n",
    "plt.imshow(weights, cmap='viridis')\n",
    "plt.colorbar(label=\"Attention Weight\")\n",
    "plt.title(\"Attention Weights Heatmap\")\n",
    "plt.xlabel(\"Key Index\")\n",
    "plt.ylabel(\"Query Index\")\n",
    "plt.xticks([0, 1, 2])\n",
    "plt.yticks([0, 1, 2])\n",
    "plt.grid(False)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b8c3cae",
   "metadata": {},
   "source": [
    "## Exercise 3: Multi-Head Attention \n",
    "\n",
    "This exercise shows how multi-head attention works by implementing a simplified version with synthetic data.\n",
    "\n",
    "Repeat Ex. (2) with a synthetic input of 3 tokens, each with an 8-d embedding and 3 attention heads"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b759c6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn.functional as F\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Set dimensions\n",
    "num_tokens = 3     # sequence length\n",
    "d_model = 8        # total embedding dimension\n",
    "num_heads = 2\n",
    "d_k = d_model // num_heads  # dimension per head\n",
    "\n",
    "# Synthetic input: 3 tokens, each with 8-d embedding\n",
    "torch.manual_seed(0)\n",
    "X = torch.rand((num_tokens, d_model))  # [3, 8]\n",
    "\n",
    "# Linear projections for Q, K, V per head (manually for clarity)\n",
    "def project(X, W):\n",
    "    return X @ W.T\n",
    "\n",
    "# Create projection weights: 2 heads, each with separate Q, K, V\n",
    "W_q = torch.rand((num_heads, d_k, d_model))\n",
    "W_k = torch.rand((num_heads, d_k, d_model))\n",
    "W_v = torch.rand((num_heads, d_k, d_model))\n",
    "\n",
    "# Compute attention for each head\n",
    "attn_outputs = []\n",
    "attn_weights_all = []\n",
    "\n",
    "for h in range(num_heads):\n",
    "    Q = project(X, W_q[h])\n",
    "    K = project(X, W_k[h])\n",
    "    V = project(X, W_v[h])\n",
    "    \n",
    "    scores = Q @ K.T / (d_k ** 0.5)  # Scaled dot-product\n",
    "    weights = F.softmax(scores, dim=-1)\n",
    "    output = weights @ V\n",
    "    \n",
    "    attn_outputs.append(output)\n",
    "    attn_weights_all.append(weights)\n",
    "\n",
    "# Concatenate the outputs from all heads\n",
    "multi_head_output = torch.cat(attn_outputs, dim=-1)\n",
    "\n",
    "# Print the result\n",
    "print(\"Multi-Head Output:\\n\", multi_head_output)\n",
    "\n",
    "# Visualize attention weights\n",
    "fig, axes = plt.subplots(1, num_heads, figsize=(12, 4))\n",
    "for i, weights in enumerate(attn_weights_all):\n",
    "    ax = axes[i]\n",
    "    ax.imshow(weights.detach().numpy(), cmap='viridis')\n",
    "    ax.set_title(f\"Head {i+1} Attention\")\n",
    "    ax.set_xlabel(\"Key Index\")\n",
    "    ax.set_ylabel(\"Query Index\")\n",
    "    ax.set_xticks(range(num_tokens))\n",
    "    ax.set_yticks(range(num_tokens))\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2db7e1b9",
   "metadata": {},
   "source": [
    "## Exercise 4: Explore Attention on a Sentence\n",
    "\n",
    "Here we will see how each word in a sentence attends to other in context.\n",
    "\n",
    "1) Input a sentence into the DistilBERT model\n",
    "2) Extract the attention weights from one or more layers\n",
    "3) Use a heat map to visualize attention across words\n",
    "\n",
    "Q. In your sentence, which words focus on others\n",
    "\n",
    "Q. How does this vary between layers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "121ee875",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import DistilBertModel, DistilBertTokenizer\n",
    "\n",
    "tokenizer = DistilBertTokenizer.from_pretrained(\"distilbert-base-uncased\")\n",
    "model = DistilBertModel.from_pretrained(\"distilbert-base-uncased\", output_attentions=True)\n",
    "\n",
    "sentence = \"Transformers capture contextual relationships.\"\n",
    "inputs = tokenizer(sentence, return_tensors=\"pt\")\n",
    "with torch.no_grad():\n",
    "    outputs = model(**inputs)\n",
    "\n",
    "attentions = outputs.attentions[0][0]  # First layer, first batch\n",
    "plt.imshow(attentions[0].numpy(), cmap='viridis')\n",
    "plt.title(\"Self-Attention Heatmap (Head 0, Layer 0)\")\n",
    "plt.colorbar()\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}